Compositions and methods for accurately identifying mutations

ABSTRACT

The present disclosure provides compositions and methods for accurately detecting mutations by uniquely tagging double stranded nucleic acid molecules with dual cyphers such that sequence data obtained from a sense strand can be linked to sequence data obtained from an anti-sense strand when sequenced, for example, by massively parallel sequencing methods.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. provisional patentapplication Serial Nos. 61/600,535, filed Feb. 17, 2012, which isincorporated herein by reference in its entirety.

STATEMENT REGARDING SEQUENCE LISTING

The Sequence Listing associated with this application is provided intext format in lieu of a paper copy, and is hereby incorporated byreference into the specification. The name of the text file containingthe Sequence Listing 360056_409WO_SEQUENCE_LISTING .txt. The text fileis 4 KB, was created on Feb. 14, 2013, and is being submittedelectronically via EFS-Web.

BACKGROUND 1. Technical Field

The present disclosure relates to compositions and methods foraccurately detecting mutations using sequencing and, more particularly,uniquely tagging double stranded nucleic acid molecules such thatsequence data obtained for a sense strand can be linked to sequence dataobtained from the anti-sense strand when obtained via massively parallelsequencing methods.

2. Description of Related Art

Detection of spontaneous mutations (e.g., substitutions, insertions,deletions, duplications), or even induced mutations, that occur randomlythroughout a genome can be challenging because these mutational eventsare rare and may exist in one or only a few copies of DNA. The mostdirect way to detect mutations is by sequencing, but the availablesequencing methods are not sensitive enough to detect rare mutations.For example, mutations that arise de novo in mitochondrial DNA (mtDNA)will generally only be present in a single copy of mtDNA, which meansthese mutations are not easily found since a mutation must be present inas much as 10-25% of a population of molecules to be detected bysequencing (Jones et al., Proc. Nat'l. Acad. Sci. U.S.A. 105:4283-88,2008). As another example, the spontaneous somatic mutation frequency ingenomic DNA has been estimated to be as low as 1×10⁻⁸ and 2.1×10⁻⁶ inhuman normal and cancerous tissues, respectively (Bielas et al., Proc.Nat'l Acad. Sci. U.S.A. 103:18238-42, 2008).

One improvement in sequencing has been to take individual DNA moleculesand amplify the number of each molecule by, for example, polymerasechain reaction (PCR) and digital PCR. Indeed, massively parallelsequencing represents a particularly powerful form of digital PCRbecause multiple millions of template DNA molecules can be analyzed oneby one. However, the amplification of single DNA molecules prior to orduring sequencing by PCR and/or bridge amplification suffers from theinherent error rate of polymerases employed for amplification, andspurious mutations generated during amplification may be misidentifiedas spontaneous mutations from the original (endogenous unamplified)nucleic acid. Similarly, DNA templates damaged during preparation (exvivo) may be amplified and incorrectly scored as mutations by massivelyparallel sequencing techniques. Again, using mtDNA as an example,experimentally determined mutation frequencies are strongly dependent onthe accuracy of the particular assay being used (Kraytsberg et al.,Methods 46:269-73, 2008)—these discrepancies suggest that thespontaneous mutation frequency of mtDNA is either below, or very closeto, the detection limit of these technologies. Massively parallelsequencing cannot generally be used to detect rare variants because ofthe high error rate associated with the sequencing process—one processusing bridge amplification and sequencing by synthesis has shown anerror rate that varies from about 0.06% to 1%, which depends on variousfactors including read length, base-calling algorithms, and the type ofvariants detected (see Kinde et al., Proc. Nat'l. Acad. Sci. U.S.A.108:9530-5, 2011).

BRIEF SUMMARY

In one aspect, the present disclosure provides a double-stranded nucleicacid molecule library that includes a plurality of target nucleic acidmolecules and a plurality of random cyphers, wherein the nucleic acidlibrary comprises molecules having a formula of X^(a)—X^(b)—Y,X^(b)—X^(a)—Y, Y—X^(a)—X^(b), Y—X^(b)—X^(a), X^(a)—Y—X^(b), orX^(b)—Y—X^(a) (in 5′ to 3′ order), wherein (a) X^(a) comprises a firstrandom cypher, (b) Y comprises a target nucleic acid molecule, and (c)X^(b) comprises a second random cypher. Furthermore, each of theplurality of random cyphers comprise a length ranging from about 5nucleotides to about 50 nucleotides (or about 5 nucleotides to about 10nucleotides, or a length of about 6, about 7, about 8, about 9, about10, about 11, about 12, about 13, about 14, about 15, about 16, about17, about 18, about 19, or about 20 nucleotides).

In certain embodiments, the double-stranded sequences of the X^(a) andX^(b) cyphers are the same (e.g., X^(a)═X^(b)) for one or more targetnucleic acid molecules, provided that each such target nucleic acidmolecule does not have the same double-stranded cypher sequence as anyother such target nucleic acid molecule. In certain other embodiments,the double-stranded sequence of the X^(a) cypher for each target nucleicacid molecule is different from the double-stranded sequence of theX^(b) cypher. In further embodiments, the double-stranded nucleic acidlibrary is contained in a self-replicating vector, such as a plasmid,cosmid, YAC, or viral vector.

In a further aspect, the present disclosure provides a method forobtaining a nucleic acid sequence or accurately detecting a truemutation in a nucleic acid molecule by amplifying each strand of theaforementioned double-stranded nucleic acid library wherein a pluralityof target nucleic acid molecules and plurality of random cyphers areamplified, and sequencing each strand of the plurality of target nucleicacid molecules and plurality of random cyphers. In certain embodiments,the sequencing is performed using massively parallel sequencing methods.In certain embodiments, the sequence of one strand of a target nucleicacid molecule associated with the first random cypher aligned with thesequence of the complementary strand associated with the second randomcypher results in a measureable sequencing error rate ranging from about10⁻⁶ to about 10⁻⁸.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a cartoon illustration of an exemplary vector of the presentdisclosure useful for generating a double-stranded nucleic acid library.

FIG. 2 is a cartoon illustration of an exemplary vector of the presentdisclosure, wherein adaptor sequences are included and are useful for,for example, bridge amplification methods before sequencing.

FIGS. 3A and 3B show characteristics of a cypher library and thedetection of true mutations. (A) Data generated in a single nextgeneration sequence run on MiSeq® demonstrates broad coverage anddiversity at the upstream seven base pair cypher in a vector library,wherein the vector used is illustrated in FIG. 2. (B) Cypher Seqeliminates errors introduced during library preparation and sequencing.Target nucleic acid molecules were ligated into a cypher vector librarycontaining previously catalogued dual, double-stranded cyphers. Thetarget sequences were amplified and sequenced. All sequencing readshaving identical cypher pairs, along with their reverse complements,were grouped into families. Comparison of family sequences allowed forgeneration of a consensus sequence wherein ‘mutations’ (errors) arisingduring library preparation (open circle) and during sequencing (graycircle and triangle) were computationally eliminated. Generally,mutations that are present in all or substantially all reads (blackdiamond) from the same cypher and its reverse complement are counted astrue mutations.

FIGS. 4A and 4B show that the cypher system can distinguish truemutations from artifact mutations. (A) Wild-type TP53 Exon 4 was ligatedinto a library of Cypher Seq vectors and sequenced on the IlluminaMiSeq® instrument with a depth of over a million. Sequences were thencompared to wild-type TP53 sequence. Detected substitutions were plottedbefore (A) and after correction (B) with Cypher Seq.

DETAILED DESCRIPTION

In one aspect, the present disclosure provides a double-stranded nucleicacid library wherein target nucleic acid molecules include dual cyphers(i.e., barcodes or origin identifier tags), one on each end (same ordifferent), so that sequencing each complementary strand can beconnected or linked back to the original molecule. The unique cypher oneach strand links each strand with its original complementary strand(e.g., before any amplification), so that each paired sequence serves asits own internal control. In other words, by uniquely taggingdouble-stranded nucleic acid molecules, sequence data obtained from onestrand of a single nucleic acid molecule can be specifically linked tosequence data obtained from the complementary strand of that samedouble-stranded nucleic acid molecule. Furthermore, sequence dataobtained from one end of a double-stranded target nucleic acid moleculecan be specifically linked to sequence data obtained from the oppositeend of that same double-stranded target nucleic acid molecule (if, forexample, it is not possible to obtain sequence data across the entiretarget nucleic acid molecule fragment of the library).

The compositions and methods of this disclosure allow a person ofordinary skill in the art to more accurately distinguish true mutations(i.e., naturally arising in vivo mutations) of a nucleic acid moleculefrom artifact “mutations” (i.e., ex vivo mutations or errors) of anucleic acid molecule that may arise for various reasons, such as adownstream amplification error, a sequencing error, or physical orchemical damage. For example, if a mutation pre-existed in the originaldouble-stranded nucleic acid molecule before isolation, amplification orsequencing, then a transition mutation of adenine (A) to guanine (G)identified on one strand will be complemented with a thymine (T) tocysteine (C) transition on the other strand. In contrast, artifact“mutations” that arise later on an individual (separate) DNA strand dueto polymerase errors during isolation, amplification or sequencing areextremely unlikely to have a matched base change in the complementarystrand. The approach of this disclosure provides compositions andmethods for distinguishing systematic errors (e.g., polymerase readfidelity errors) and biological errors (e.g., chemical or other damage)from actual known or newly identified true mutations or singlenucleotide polymorphisms (SNPs).

In certain embodiments, the two cyphers on each target molecule havesequences that are distinct from each other and, therefore, provide aunique pair of identifiers wherein one cypher identifies (or isassociated with) a first end of a target nucleic acid molecule and thesecond cypher identifies (or is associated with) the other end of thetarget nucleic acid molecule. In certain other embodiments, the twocyphers on each target molecule have the same sequence and, therefore,provide a unique identifier for each strand of the target nucleic acidmolecule. Each strand of the double-stranded nucleic acid library (e.g.,genomic DNA, cDNA) can be amplified and sequenced using, for example,next generation sequencing technologies (such as, emulsion PCR or bridgeamplification combined with pyrosequencing or sequencing by synthesis,or the like). The sequence information from each complementary strand ofa first double-stranded nucleic acid molecule can be linked and compared(e.g., computationally “de-convoluted”) due to the unique cyphersassociated with each end or strand of that particular double-strandednucleic acid molecule. In other words, each original double-strandednucleic acid molecule fragment found in a library of molecules can beindividually reconstructed due to the presence of an associated uniquebarcode or pair of barcode (identifier tag) sequences on each targetfragment or strand.

By way of background, any spontaneous or induced mutation will bepresent in both strands of a native genomic, double-stranded DNAmolecule. Hence, such a mutant DNA template amplified using PCR willresult in a PCR product in which 100% of the molecules produced by PCRinclude the mutation. In contrast to an original, spontaneous mutation,a change due to polymerase error will only appear in one strand of theinitial template DNA molecule (while the other strand will not have theartifact mutation). If all DNA strands in a PCR reaction are copiedequally efficiently, then any polymerase error that emerges from thefirst PCR cycle likely will be found in at least 25% of the total PCRproduct. But DNA molecules or strands are not copied equallyefficiently, so DNA sequences amplified from the strand thatincorporated an erroneous nucleotide base during the initialamplification might constitute more or less than 25% of the populationof amplified DNA sequences depending on the efficiency of amplification,but still far less than 100%. Similarly, any polymerase error thatoccurs in later PCR cycles will generally represent an even smallerproportion of PCR products (i.e., 12.5% for the second cycle, 6.25% forthe third, etc.) containing a “mutation.” PCR-induced mutations may bedue to polymerase errors or due to the polymerase bypassing damagednucleotides, thereby resulting in an error (see, e.g., Bielas and Loeb,Nat. Methods 2:285-90, 2005). For example, a common change to DNA is thedeamination of cytosine, which is recognized by Taq polymerase as auracil and results in a cytosine to thymine transition mutation (Zhenget al., Mutat. Res. 599:11-20, 2006)—that is, an alteration in theoriginal DNA sequence may be detected when the damaged DNA is sequenced,but such a change may or may not be recognized as a sequencing reactionerror or due to damage arising ex vivo (e.g., during or after nucleicacid isolation).

Due to potential artifacts and alterations of nucleic acid moleculesarising from isolation, amplification and sequencing, the accurateidentification of true somatic DNA mutations is difficult whensequencing amplified nucleic acid molecules. Consequently, evaluation ofwhether certain mutations are related to, or are a biomarker for,various disease states (e.g., cancer) or aging becomes confounded.

Next generation sequencing has opened the door to sequencing multiplecopies of an amplified single nucleic acid molecule—referred to as deepsequencing. The thought on deep sequencing is that if a particularnucleotide of a nucleic acid molecule is sequenced multiple times, thenone can more easily identify rare sequence variants or mutations. Infact, however, the amplification and sequencing process has an inherenterror rate (which may vary depending on DNA quality, purity,concentration (e.g., cluster density), or other conditions), so nomatter how few or how many times a nucleic acid molecule is sequenced, aperson of skill in the art cannot distinguish a polymerase errorartifact from a true mutation (especially rare mutations).

While being able to sequence many different DNA molecules collectivelyis advantageous in terms of cost and time, the price for this efficiencyand convenience is that various PCR errors complicate mutationalanalysis as long as their frequency is comparable to that of mutationsarising in vivo—in other words, genuine in vivo mutations will beessentially indistinguishable from changes that are artifacts of PCR orsequencing errors.

Thus, the present disclosure, in a further aspect, provides methods foridentifying mutations present before amplification or sequencing of adouble-stranded nucleic acid library wherein the target moleculesinclude a single double-stranded cypher or dual cyphers (i.e., barcodesor identifier tags), one on each end, so that sequencing eachcomplementary strand can be connected back to the original molecule. Incertain embodiments, the method enhances the sensitivity of thesequencing method such that the error rate is 5×10⁻⁶, 10⁻⁶, 5×10⁻⁷,10⁻⁷, 5×10⁻⁸, 10⁻⁸ or less when sequencing many different target nucleicacid molecules simultaneously or such that the error rate is 5×10⁻⁷,10⁻⁷, 5×10⁻⁸, 10⁻⁸ or less when sequencing a single target nucleic acidmolecule in depth.

Prior to setting forth this disclosure in more detail, it may be helpfulto an understanding thereof to provide definitions of certain terms tobe used herein. Additional definitions are set forth throughout thisdisclosure.

In the present description, any concentration range, percentage range,ratio range, or integer range is to be understood to include the valueof any integer within the recited range and, when appropriate, fractionsthereof (such as one tenth and one hundredth of an integer), unlessotherwise indicated. Also, any number range recited herein relating toany physical feature, such as polymer subunits, size or thickness, areto be understood to include any integer within the recited range, unlessotherwise indicated. As used herein, the terms “about” and “consistingessentially of” mean±20% of the indicated range, value, or structure,unless otherwise indicated. It should be understood that the terms “a”and “an” as used herein refer to “one or more” of the enumeratedcomponents. The use of the alternative (e.g., “or”) should be understoodto mean either one, both, or any combination thereof of thealternatives. As used herein, the terms “include,” “have” and “comprise”are used synonymously, which terms and variants thereof are intended tobe construed as non-limiting.

As used herein, the term “random cypher” or “cypher” or “barcode” or“identifier tag” and variants thereof are used interchangeably and referto a nucleic acid molecule having a length ranging from about 5 to about50 nucleotides. In certain embodiments, all of the nucleotides of thecypher are not identical (i.e., comprise at least two differentnucleotides) and optionally do not contain three contiguous nucleotidesthat are identical. In further embodiments, the cypher is comprised ofabout 5 to about 15 nucleotides, about 6 to about 10 nucleotides, andpreferably about 7 to about 12 nucleotides. Cyphers will generally belocated at one or both ends a target molecule may, which may beincorporated directly onto target molecules of interest or onto a vectorinto which target molecules will be later added.

As used herein, “target nucleic acid molecules” and variants thereofrefer to a plurality of double-stranded nucleic acid molecules that maybe fragments or shorter molecules generated from longer nucleic acidmolecules, including from natural samples (e.g., a genome), or thetarget nucleic acid molecules may be synthetic (e.g., cDNA),recombinant, or a combination thereof. Target nucleic acid fragmentsfrom longer molecules may be generated using a variety of techniquesknown in the art, such as mechanical shearing or specific cleavage withrestriction endonucleases. As used herein, a “nucleic acid moleculelibrary” and variants thereof refers to a collection of nucleic acidmolecules or fragments. In certain embodiments, the collection ofnucleic acid molecules or fragments is incorporated into a vector, whichcan be transformed or transfected into an appropriate host cell. Thetarget nucleic acid molecules of this disclosure may be introduced intoa variety of different vector backbones (such as plasmids, cosmids,viral vectors, or the like) so that recombinant production of a nucleicacid molecule library can be maintained in a host cell of choice (suchas bacteria, yeast, mammalian cells, or the like).

For example, a collection of nucleic acid molecules representing theentire genome is called a genomic library and a collection of DNA copiesof messenger RNA is referred to as a complimentary DNA (cDNA) library.Methods for introducing nucleic acid molecule libraries into vectors arewell known in the art (see, e.g., Current Protocols in MolecularBiology, Ausubel et al., Eds., Greene Publishing and Wiley-Interscience,New York, 1995; Sambrook et al., Molecular Cloning: A Laboratory Manual,2nd Ed., Cold Spring Harbor Laboratory Vols. 1-3, 1989; Methods inEnzymology, Vol. 152, Guide to Molecular Cloning Techniques, Berger andKimmel, Eds., San Diego: Academic Press, Inc., 1987).

Depending on the type of library to be generated, the ends of the targetnucleic acid fragments may have overhangs or may be “polished” (i.e.,blunted). Together, the target nucleic acid molecule fragments can be,for example, cloned directly into a cypher vector to generate a vectorlibrary, or be ligated with adapters to generate, for example, polonies.The target nucleic acid molecules, which are the nucleic acid moleculesof interest for amplification and sequencing, may range in size from afew nucleotides (e.g., 50) to many thousands (e.g., 10,000). Preferably,the target fragments in the library range in size from about 100nucleotides to about 750 nucleotides or about 1,000 nucleotides, or fromabout 150 nucleotides to about 250 nucleotides or about 500 nucleotides.

As used herein, a “nucleic acid molecule priming site” or “PS” andvariants thereof are short, known nucleic acid sequences contained inthe vector. A PS sequence can vary in length from 5 nucleotides to about50 nucleotides in length, about 10 nucleotides to about 30 nucleotides,and preferably are about 15 nucleotides to about 20 nucleotides inlength. In certain embodiments, a PS sequence may be included at the oneor both ends or be an integral part of the random cypher nucleic acidmolecules, or be included at the one or both ends or be an integral partof an adapter sequence, or be included as part of the vector. A nucleicacid molecule primer that is complementary to a PS included in a libraryof the present disclosure can be used to initiate a sequencing reaction.

For example, if a random cypher only has a PS upstream (5′) of thecypher, then a primer complementary to the PS can be used to prime asequencing reaction to obtain the sequence of the random cypher and somesequence of a target nucleic acid molecule cloned downstream of thecypher. In another example, if a random cypher has a first PS upstream(5′) and a second PS downstream (3′) of the cypher, then a primercomplementary to the first PS can be used to prime a sequencing reactionto obtain the sequence of the random cypher, the second PS and somesequence of a target nucleic acid molecule cloned downstream of thesecond PS. In contrast, a primer complementary to the second PS can beused to prime a sequencing reaction to directly obtain the sequence ofthe target nucleic acid molecule cloned downstream of the second PS. Inthis latter case, more target molecule sequence information will beobtained since the sequencing reaction beginning from the second PS canextend further into the target molecule than does the reaction having toextend through both the cypher and the target molecule.

As used herein, “next generation sequencing” refers to high-throughputsequencing methods that allow the sequencing of thousands or millions ofmolecules in parallel. Examples of next generation sequencing methodsinclude sequencing by synthesis, sequencing by ligation, sequencing byhybridization, polony sequencing, and pyrosequencing. By attachingprimers to a solid substrate and a complementary sequence to a nucleicacid molecule, a nucleic acid molecule can be hybridized to the solidsubstrate via the primer and then multiple copies can be generated in adiscrete area on the solid substrate by using polymerase to amplify(these groupings are sometimes referred to as polymerase colonies orpolonies). Consequently, during the sequencing process, a nucleotide ata particular position can be sequenced multiple times (e.g., hundreds orthousands of times)—this depth of coverage is referred to as “deepsequencing.”

As used herein, “base calling” refers to the computational conversion ofraw or processed data from a sequencing instrument into quality scoresand then actual sequences. For example, many of the sequencing platformsuse optical detection and charge coupled device (CCD) cameras togenerate images of intensity information (i.e., intensity informationindicates which nucleotide is in which position of a nucleic acidmolecule), so base calling generally refers to the computational imageanalysis that converts intensity data into sequences and quality scores.Another example is the ion torrent sequencing technology, which employsa proprietary semiconductor ion sensing technology to detect release ofhydrogen ions during incorporation of nucleotide bases in sequencingreactions that take place in a high density array of micro-machinedwells. There are other examples of methods known in the art that may beemployed for simultaneous sequencing of large numbers of nucleotidemolecules. Various base calling methods are described in, for example,Niedringhaus et al. (Anal. Chem. 83:4327, 2011), which methods areherein incorporated by reference in their entirety.

In the following description, certain specific details are set forth inorder to provide a thorough understanding of various embodiments of thisdisclosure. However, upon reviewing this disclosure, one skilled in theart will understand that the invention may be practiced without many ofthese details. In other instances, newly emerging next generationsequencing technologies, as well as well-known or widely available nextgeneration sequencing methods (e.g., chain-termination sequencing,dye-terminator sequencing, reversible dye-terminator sequencing,sequencing by synthesis, sequencing by ligation, sequencing byhybridization, polony sequencing, pyrosequencing, ion semiconductorsequencing, nanoball sequencing, nanopore sequencing, single moleculesequencing, FRET sequencing, base-heavy sequencing, and microfluidicsequencing), have not all been described in detail to avoidunnecessarily obscuring the descriptions of the embodiments of thepresent disclosure. Descriptions of some of these methods, which methodsare herein incorporated by reference in their entirety, can be found,for example, in PCT Publication Nos. WO 98/44151, WO 00/18957, and WO2006/08413; and U.S. Pat. Nos. 6,143,496, 6,833,246, and 7,754,429; andU.S. Patent Application Publication Nos. U.S. 2010/0227329 and U.S.2009/0099041.

Various embodiments of the present disclosure are described for purposesof illustration, in the context of use with vectors containing a libraryof nucleic acid fragments (e.g., genomic or cDNA library). However, asthose skilled in the art will appreciate upon reviewing this disclosure,use with other nucleic acid libraries or methods for making a library ofnucleic acid fragments may also be suitable.

In certain embodiments, a double-stranded nucleic acid library comprisesa plurality of target nucleic acid molecules and a plurality of randomcyphers, wherein the nucleic acid library comprises molecules having aformula of X^(a)—Y—X^(b) (in 5′ to 3′ order), wherein (a) X^(a)comprises a first random cypher, (b) Y comprises a target nucleic acidmolecule, and (c) X^(b) comprises a second random cypher; wherein eachof the plurality of random cyphers have a length of about 5 to about 50nucleotides. In certain embodiments, the double-stranded sequence of theX^(a) cypher for each target nucleic acid molecule is different from thedouble-stranded sequence of the X^(b) cypher. In certain otherembodiments, the double-stranded X^(a) cypher is identical to the X^(b)cypher for one or more target nucleic acid molecules, provided that thedouble-stranded cypher for each target nucleic acid molecule isdifferent.

In further embodiments, the plurality or pool of random cyphers used inthe double-stranded nucleic acid molecule library or vector librarycomprise from about 5 nucleotides to about 40 nucleotides, about 5nucleotides to about 30 nucleotides, about 6 nucleotides to about 30nucleotides, about 6 nucleotides to about 20 nucleotides, about 6nucleotides to about 10 nucleotides, about 6 nucleotides to about 8nucleotides, about 7 nucleotides to about 9 or about 10 nucleotides, orabout 6, about 7 or about 8 nucleotides. In certain embodiments, acypher preferably has a length of about 6, about 7, about 8, about 9,about 10, about 11, about 12, about 13, about 14, about 15, about 16,about 17, about 18, about 19, or about 20 nucleotides. In certainembodiments, a pair of random cyphers associated with nucleic acidsequences or vectors will have different lengths or have the samelength. For example, a target nucleic acid molecule or vector may havean upstream (5′) first random cypher of about 6 nucleotides in lengthand a downstream (3′) second random cypher of about 9 nucleotides inlength, or a target nucleic acid molecule or vector may have an upstream(5′) first random cypher of about 7 nucleotides in length and adownstream (3′) second random cypher of about 7 nucleotides in length.

In certain embodiments, both the X^(a) cypher and the X^(b) cypher eachcomprise 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10nucleotides, 11 nucleotides, 12 nucleotides, 13 nucleotides, 14nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18nucleotides, 19 nucleotides, or 20 nucleotides. In certain otherembodiments, the X^(a) cypher comprises 6 nucleotides and the X^(b)cypher comprises 7 nucleotides or 8 nucleotides; or the X^(a) cyphercomprises 7 nucleotides and the X^(b) cypher comprises 6 nucleotides or8 nucleotides; or the X^(a) cypher comprises 8 nucleotides and the X^(b)cypher comprises 6 nucleotides or 7 nucleotides; or the X^(a) cyphercomprises 10 nucleotides and the X^(b) cypher comprises 11 nucleotidesor 12 nucleotides.

The number of nucleotides contained in each of the random cyphers orbarcodes will govern the total number of possible barcodes available foruse in a library. Shorter barcodes allow for a smaller number of uniquecyphers, which may be useful when performing a deep sequence of one or afew nucleotide sequences, whereas longer barcodes may be desirable whenexamining a population of nucleic acid molecules, such as cDNAs orgenomic fragments. In certain embodiments, multiplex sequencing may bedesired when targeting specific nucleic acid molecules, specific genomicregions, smaller genomes, or a subset of cDNA transcripts. Multiplexsequencing involves amplifying two or more samples that have been pooledinto, for example, a single lane of a flow cell for bridge amplificationto exponentially increase the number of molecules analyzed in a singlerun without sacrificing time or cost. In related embodiments, a uniqueindex sequence (comprising a length ranging from about 4 nucleotides toabout 25 nucleotides) specific for a particular sample is included witheach dual cypher vector library. For example, if ten different samplesare being pooled in preparation for multiplex sequencing, then tendifferent index sequences will be used such that ten dual cypher vectorlibraries are used in which each library has a single, unique indexsequence identifier (but each library has a plurality of randomcyphers).

For example, a barcode of 7 nucleotides would have a formula of5′-NNNNNNN-3′ (SEQ ID NO.:1), wherein N may be any naturally occurringnucleotide. The four naturally occurring nucleotides are A, T, C, and G,so the total number of possible random cyphers is 4⁷, or 16,384 possiblerandom arrangements (i.e., 16,384 different or unique cyphers). For 6and 8 nucleotide barcodes, the number of random cyphers would be 4,096and 65,536, respectively. In certain embodiments of 6, 7 or 8 randomnucleotide cyphers, there may be fewer than the pool of 4,094, 16,384 or65,536 unique cyphers, respectively, available for use when excluding,for example, sequences in which all the nucleotides are identical (e.g.,all A or all T or all C or all G) or when excluding sequences in whichthree contiguous nucleotides are identical or when excluding both ofthese types of molecules. In addition, the first about 5 nucleotides toabout 20 nucleotides of the target nucleic acid molecule sequence may beused as a further identifier tag together with the sequence of anassociated random cypher.

In still further embodiments, a double-stranded nucleic acid librarycomprises a plurality of target nucleic acid molecules and a pluralityof random cyphers, wherein the nucleic acid library comprises moleculeshaving a formula of X^(a)—Y—X^(b) (in 5′ to 3′ order), wherein (a) X^(a)comprises a first random cypher, (b) Y comprises a target nucleic acidmolecule, and (c) X^(b) comprises a second random cypher; wherein eachof the plurality of random cyphers have a length of about 5 to about 50nucleotides and wherein (i) at least two of those nucleotides aredifferent in each cypher or (ii) each cypher does not contain threecontiguous nucleotides that are identical. In certain embodimentswherein each cypher does not contain three contiguous nucleotides thatare identical, the double-stranded X^(a) cypher is identical to theX^(b) cypher for one or more target nucleic acid molecules, providedthat the double-stranded cypher for each target nucleic acid molecule isdifferent.

In yet further embodiments, a double-stranded nucleic acid librarycomprises a plurality of target nucleic acid molecules and a pluralityof random cyphers, wherein the nucleic acid library comprises moleculeshaving a formula of X^(a)—X^(b)—Y, X^(b)—X^(a)—Y, Y—X^(a)—X^(b),Y—X^(b)—X^(a), X^(a)—Y, X^(b)—Y, Y—X^(a), or Y—X^(b) (in 5′ to 3′order), wherein (a) X^(a) comprises a first random cypher, (b) Ycomprises a target nucleic acid molecule, and (c) X^(b) comprises asecond random cypher; wherein each of the plurality of random cyphershave a length of about 5 to about 50 nucleotides.

In any of the embodiments described herein, an X^(a) cypher furthercomprises about a 5 nucleotide to about a 20 nucleotide sequence of thetarget nucleic acid molecule that is downstream of the X^(a) cypher, oran X^(b) cypher further comprises about a 5 nucleotide to about a 20nucleotide sequence of the target nucleic acid molecule that is upstreamof the X^(b) cypher, or an X^(a) cypher and X^(b) cypher furthercomprise about a 5 nucleotide to about a 20 nucleotide sequence of thetarget nucleic acid molecule that is downstream or upstream,respectively, of each cypher.

In yet further embodiments, a first target molecule is associated withand disposed between a first random cypher X^(a) and a second randomcypher X^(b), a second target molecule is associated with and disposedbetween a third random cypher X^(a) and a fourth random cypher X^(b),and so on, wherein the target molecules of a library or of a vectorlibrary each has a unique X^(a) cypher (i.e., none of the X^(a) cyphershave the same sequence) and each has a unique X^(b) cypher (i.e., noneof the X^(b) cyphers have the same sequence), and wherein none or only aminority of the X^(a) and X^(b) cyphers have the same sequence.

For example, if the length of the random cypher is 7 nucleotides, thenthere will a total of 16,384 different barcodes available as firstrandom cypher X^(a) and second random cypher X^(b). In this case, if afirst target nucleic acid molecule is associated with and disposedbetween random cypher X^(a) number 1 and random cypher X^(b) number 2and a second target nucleic acid molecule is associated with anddisposed between random cypher X^(a) number 16,383 and random cypherX^(b) number 16,384, then a third target nucleic acid molecule can onlybe associated with and disposed between any pair of random cyphernumbers selected from numbers 3 to 16,382, and so on for each targetnucleic acid molecule of a library until each of the different randomcyphers have been used (which may or may not be all 16,382). In thisembodiment, each target nucleic acid molecule of a library will have aunique pair of cyphers that differ from each of the other pairs ofcyphers found associated with each other target nucleic acid molecule ofthe library.

In any of the embodiments described herein, random cypher sequences froma particular pool of cyphers (e.g., pools of 4,094, 16,384 or 65,536unique cyphers) may be used more than once. In further embodiments, eachtarget nucleic acid molecule or a subset of target molecules has adifferent (unique) pair of cyphers. For example, if a first targetmolecule is associated with and disposed between random cypher number 1and random cypher number 100, then a second target molecule will need tobe flanked by a different dual pair of cyphers—such as random cyphernumber 1 and random cypher number 65, or random cypher number 486 andrandom cypher number 100—which may be any combination other than 1 and100. In certain other embodiments, each target nucleic acid molecule ora subset of target molecules has identical cyphers on each end of one ormore target nucleic acid molecules, provided that the double-strandedcypher for each target nucleic acid molecule is different. For example,if a first target molecule is flanked by cypher number 10, then a secondtarget molecule having identical cyphers on each end will have to have adifferent cypher—such as random cypher number 555 or the like—which maybe any other cypher other than 10. In still further embodiments, targetnucleic acid molecules of the nucleic acid molecule library will eachhave dual unique cyphers X^(a) and X^(b), wherein none of the X^(a)cyphers have the same sequence as any other X^(a) cypher, none of theX^(b) cyphers have the same sequence as any other X^(b) cypher, and noneof the X^(a) cyphers have the same sequence as any X^(b) cypher. Instill further embodiments, target nucleic acid molecules of the nucleicacid molecule library will each have a unique pair of X^(a)—X^(b)cyphers wherein none of the X^(a) or X^(b) cyphers have the samesequence. A mixture of any of the aforementioned embodiments may make upa nucleic acid molecule library of this disclosure.

In any of the embodiments described herein, the plurality of targetnucleic acid molecules that together are used to generate a nucleic acidmolecule library (or used for insertion into a vector to generate avector library containing a plurality of target nucleic acid molecules)may each have a length that ranges from about 10 nucleotides to about10,000 nucleotides, from about 50 nucleotides to about 5,000nucleotides, from about 100 nucleotides to about 1,000 nucleotides, orfrom about 150 nucleotides to about 750 nucleotides, or from about 250nucleotides to about 500 nucleotides.

In any of the embodiments described herein, the plurality of randomcyphers may further be linked to a first nucleic acid molecule primingsite (PS1), linked to a second nucleic acid molecule priming site (PS2),or linked to both a first and a second nucleic acid molecule primingsite. In certain embodiments, a plurality of random cyphers may each beassociated with and disposed between a first nucleic acid moleculepriming site (PS1) and a second nucleic acid molecule priming site(PS2), wherein the double-stranded sequence of PS1 is different from thedouble-stranded sequence of PS2. In certain embodiments, each pair ofX^(a)—X^(b) cyphers may be associated with and disposed between anupstream and a downstream nucleic acid molecule priming site (PS1) (see,e.g., FIG. 2).

In any of the embodiments described herein, a first nucleic acidmolecule priming site PS1 will be located upstream (5′) of the firstrandom cypher X^(a) and the first nucleic acid molecule priming site PS1will also be located downstream (3′) of the second random cypher X^(b).In certain embodiments, an oligonucleotide primer complementary to thesense strand of PS1 can be used to prime a sequencing reaction to obtainthe sequence of the sense strand of the first random cypher X^(a) or toprime a sequencing reaction to obtain the sequence of the anti-sensestrand of the second random cypher X^(b), whereas an oligonucleotideprimer complementary to the anti-sense strand of PS1 can be used toprime a sequencing reaction to obtain the sequence of the anti-sensestrand of the first random cypher X^(a) or to prime a sequencingreaction to obtain the sequence of the sense strand of the second randomcypher X^(b).

In any of the embodiments described herein, the second nucleic acidmolecule priming site PS2 will be located downstream (3′) of the firstrandom cypher X^(a) and the second nucleic acid molecule priming sitePS2 will also be located upstream (5′) of the second random cypherX^(b). In certain embodiments, an oligonucleotide primer complementaryto the sense strand of PS2 can be used to prime a sequencing reaction toobtain the sequence of the sense strand from the 5′-end of theassociated double-stranded target nucleic acid molecule or to prime asequencing reaction to obtain the sequence of the anti-sense strand fromthe 3′-end of the associated double-stranded target nucleic acidmolecule, whereas an oligonucleotide primer complementary to theanti-sense strand of PS2 can be used to prime a sequencing reaction toobtain the sequence of the anti-sense strand from the 5′-end of theassociated double-stranded target nucleic acid molecule or to prime asequencing reaction to obtain the sequence of the sense strand from the3′-end of the associated double-stranded target nucleic acid molecule.

Depending on the length of the target nucleic acid molecule, the entiretarget nucleic acid molecule sequence may be obtained if it is shortenough or only a portion of the entire target nucleic acid moleculesequence may be obtained if it is longer than about 100 nucleotides toabout 250 nucleotides. An advantage of the compositions and methods ofthe instant disclosure is that even though a target nucleic acidmolecule is too long to obtain sequence data for the entire molecule orfragment, the sequence data obtained from one end of a double-strandedtarget molecule can be specifically linked to sequence data obtainedfrom the opposite end or from the second strand of that samedouble-stranded target molecule because each target molecule in alibrary of this disclosure will have double-stranded cyphers, or aunique X^(a)—X^(b) pair of cyphers. Linking the sequence data of the twostrands allows for sensitive identification of “true” mutations whereindeeper sequencing actually increases the sensitivity of the detection,and these methods can provide sufficient data to quantify the number ofartifact mutations.

In any of the embodiments described herein, a plurality of randomcyphers may further comprise a first restriction endonucleaserecognition sequence (RE1) and a second restriction endonucleaserecognition sequence (RE2), wherein the first restriction endonucleaserecognition sequence RE1 is located upstream (5′) of the first randomcypher X^(a) and the second restriction endonuclease recognitionsequence RE2 is located downstream (3′) of the second random cypherX^(b). In certain embodiments, a first restriction endonucleaserecognition sequence RE1 and a second restriction endonucleaserecognition sequence RE2 are the same or different. In certainembodiments, RE1, RE2, or both RE1 and RE2 are “rare-cutter” restrictionendonucleases that have a recognition sequence that occurs only rarelywithin a genome or within a target nucleic acid molecule sequence or are“blunt-cutters” that generate nucleic acid molecules with blunt endsafter digestion (e.g., SmaI). Such rare cutter enzymes generally havelonger recognition sites with seven- or eight-nucleotide or longerrecognition sequences, such as AarI, AbeI, AscI, AsiSI, BbvCI,BstRZ2461, BstSWI, CciNI, CsiBI, CspBI, FseI, NotI, MchAI, MspSWI, MssI,PacI, PmeI, SbfI, SdaI, SgfI, SmiI, SrfI, Sse232I, Sse8387I, SwaI,TaqII, VpaK32I, or the like.

In certain embodiments, a nucleic acid molecule library comprisesnucleic acid molecules having a formula of5′-RE1-PS1-X^(a)-PS2-Y-PS2-X^(b)-PS1-RE2-3′, wherein RE1 is a firstrestriction endonuclease recognition sequence, PS1 is a first nucleicacid molecule priming site, PS2 is a second nucleic acid moleculepriming site, RE2 is a second restriction endonuclease recognitionsequence, Y comprises a target nucleic acid molecule, and r and X^(b)are cyphers comprising a length ranging from about 5 nucleotides toabout 50 nucleotides or about 6 nucleotides to about 15 nucleotides orabout 7 nucleotides to about 9 nucleotides. In further embodiments, RE1and RE2 are sequences recognized by the same restriction endonuclease oran isoschizomer or neoschizomer thereof, or RE1 and RE2 have differentsequences recognized by different restriction endonucleases. In furtherembodiments, PS1 and PS2 have different sequences. In furtherembodiments, target nucleic acid molecules of the nucleic acid moleculelibrary will each have dual unique cyphers X^(a) and X^(b), wherein noneof the X^(a) cyphers have the same sequence as any other X^(a) cypher,none of the X^(b) cyphers have the same sequence as any other X^(b)cypher, and none of the X^(a) cyphers have the same sequence as anyX^(b) cypher. In still further embodiments, target nucleic acidmolecules of the nucleic acid molecule library will each have a uniquecypher or pair of X^(a)—X^(b) cyphers wherein none of the X^(a) or X^(b)cyphers have the same sequence.

Also contemplated in the present disclosure is using a library ofdouble-stranded barcoded or dual double-stranded barcoded target nucleicacid molecules for amplification and sequencing reactions to detect truemutations. In order to facilitate certain amplification or sequencingmethods, other features may be included in the compositions of theinstant disclosure. For example, bridge amplification may involveligating adapter sequences to each end of a population of target nucleicacid molecules. Single-stranded oligonucleotide primers complementary tothe adapters are immobilized on a solid substrate, the target moleculescontaining the adapter sequences are denatured into single strands, andhybridized to complementary primers on the solid substrate. An extensionreaction is used to copy the hybridized target molecule and thedouble-stranded product is denatured into single strands again. Thecopied single strands then loop over (form a “bridge”) and hybridizewith a complementary primer on the solid substrate, upon which theextension reaction is run again. In this way, many target molecules maybe amplified at the same time and the resulting product is subject tomassive parallel sequencing.

In certain embodiments, a nucleic acid molecule library comprisesnucleic acid molecules having a formula of5′-RE1-AS-PS1-X^(a)-PS2-Y-PS2-X^(b)-PS1-AS-RE2-3′, wherein RE1 and RE2are first and second restriction endonuclease recognition sequences, PS1and PS2 are a first and second nucleic acid molecule priming sites, ASis an adapter sequence comprising a length ranging from about 20nucleotides to about 100 nucleotides, Y comprises a target nucleic acidmolecule, and X^(a) and X^(b) are cyphers comprising a length rangingfrom about 5 nucleotides to about 50 nucleotides or about 6 nucleotidesto about 15 nucleotides or about 7 nucleotides to about 9 nucleotides.

In further embodiments, a nucleic acid molecule library comprisesnucleic acid molecules having a formula of5′-RE1-AS-PS1-X^(a)—Y—X^(b)-PS1-AS-RE2-3′, wherein RE1 and RE2 are firstand second restriction endonuclease recognition sequences, PS1 is afirst nucleic acid molecule priming site, AS is an adapter sequencecomprising a length ranging from about 20 nucleotides to about 100nucleotides, Y comprises a target nucleic acid molecule, and X^(a) andX^(b) are cyphers comprising a length ranging from about 5 nucleotidesto about 50 nucleotides or about 6 nucleotides to about 15 nucleotidesor about 7 nucleotides to about 9 nucleotides. In related embodiments,the AS adapter sequence of the aforementioned vector may furthercomprise a PS2 that is a second nucleic acid molecule priming site orthe PS2 may be a part of the original AS sequence. In still furtherembodiments, the nucleic acid molecule library may further comprise anindex sequence (comprising a length ranging from about 4 nucleotides toabout 25 nucleotides) located between each of the first and second ASand the PS1 so that the library can be pooled with other librarieshaving different index sequences to facilitate multiplex sequencing(also referred to as multiplexing) either before or after amplification.

Each of the aforementioned dual barcoded target nucleic acid moleculesmay be assembled into a carrier library in the form of, for example, aself-replicating vector, such as a plasmid, cosmid, YAC, viral vector orother vectors known in the art. In certain embodiments, any of theaforementioned double-stranded nucleic acid molecules comprising aplurality of target nucleic acid molecules and a plurality of randomcyphers, are contained in a vector. In still further embodiments, such avector library is carried in a host cell, such as bacteria, yeast, ormammalian cells.

The present disclosure also provides vectors useful for generating alibrary of dual barcoded target nucleic acid molecules according to thisdisclosure. Exemplary vectors comprising cyphers and other elements ofthis disclosure are illustrated in FIGS. 1 and 2.

In certain embodiments, there are provided a plurality of nucleic acidvectors, comprising a plurality of random cyphers, wherein each vectorcomprises a region having a formula of5′-RE1-PS1-X^(a)-PS2-RE3-PS2-X^(b)-PS1-RE2-3′, wherein (a) RE1 is afirst restriction endonuclease recognition sequence, (b) PS1 is a firstnucleic acid molecule priming site, (c) X^(a) comprises a first randomcypher, (d) RE3 is a third restriction endonuclease recognitionsequence, wherein RE3 is a site into which a target nucleic acidmolecule can be inserted, (e) X^(b) comprises a second random cypher,(f) PS2 is a second nucleic acid molecule priming site, and (g) RE2 is asecond restriction endonuclease recognition sequence; and wherein eachof the plurality of random cyphers comprise a length ranging from about5 nucleotides to about 50 nucleotides, preferably from about 7nucleotides to about 9 nucleotides; and wherein the plurality of nucleicacid vectors are useful for preparing a double-stranded nucleic acidmolecule library in which each vector has a different target nucleicacid molecule insert. In certain embodiments, the sequence of the X^(a)cypher is different from the sequence of the X^(b) cypher in each vector(that is, each vector has a unique pair). In further embodiments, theplurality of nucleic acid vectors may further comprise at least oneadapter sequence (AS) between RE1 and PS1 and at least one AS betweenPS1 and RE2, or comprise at least one AS between RE1 and X^(a) cypherand at least one AS between X^(b) cypher and RE2, wherein the ASoptionally has a priming site.

In further vector embodiments, the plurality of random cyphers can eachhave the same or different number of nucleotides, and comprise fromabout 6 nucleotides to about 8 nucleotides to about 10 nucleotides toabout 12 nucleotides to about 15 nucleotides. In still otherembodiments, a plurality of target nucleic acid molecules comprisingfrom about 10 nucleotides to about 10,000 nucleotides or comprising fromabout 100 nucleotides to about 750 nucleotides or to about 1,000nucleotides, may be inserted into the vector at RE3. In certainembodiments, RE3 will cleave DNA into blunt ends and the plurality oftarget nucleic acid molecules ligated into this site will also beblunt-ended.

In certain embodiments, the plurality of nucleic acid vectors whereineach vector comprises a region having a formula of5′-RE1-PS1-X^(a)-PS2-RE3-PS2-X^(b)-PS1-RE2-3′ the X^(a) cyphers andX^(b) cyphers on each vector is sequenced before a target nucleic acidmolecule is inserted into each vector. In further embodiments, theplurality of nucleic acid vectors wherein each vector comprises a regionhaving a formula of 5′-RE1-PS1-X^(a)-PS2-RE3-PS2-X^(b)-PS1-RE2-3′ theX^(a) cyphers and X^(b) cyphers on each vector is sequenced after atarget nucleic acid molecule is inserted into each vector or issequenced at the same time a target nucleic acid molecule insert issequenced.

The dual barcoded target nucleic acid molecules and the vectorscontaining such molecules of this disclosure may further be used insequencing reactions to determine the sequence and mutation frequency ofthe molecules in the library. In certain embodiments, this disclosureprovides a method for obtaining a nucleic acid sequence by preparing adouble-stranded dual barcoded nucleic acid library as described hereinand then sequencing each strand of the plurality of target nucleic acidmolecules and plurality of random cyphers. In certain embodiments,target nucleic acid molecules and and associated cyphers are excised forsequencing directly from the vector using restriction endonucleaseenzymes prior to amplification. In certain embodiments, next generationsequencing methods are used to determine the sequence of librarymolecules, such as sequencing by synthesis, pyrosequencing, reversibledye-terminator sequencing or polony sequencing.

In still further embodiments, there are provided methods for determiningthe error rate due to amplification and sequencing by determining thesequence of one strand of a target nucleic acid molecule associated withthe first random cypher and aligning with the sequence of thecomplementary strand associated with the second random cypher todistinguish between a pre-existing mutation and an amplification orsequencing artifact mutation, wherein the measured sequencing error ratewill range from about 10⁻⁶ to about 5×10⁻⁶ to about 10⁻⁷ to about 5×10⁻⁷to about 10⁻⁸ to about 10⁻⁹. In other words, using the methods of thisdisclosure, a person of ordinary skill in the art can associate each DNAsequence read to an original template DNA. Given that both strands ofthe original double-stranded DNA are barcoded with associated barcodes,this increases the sensitivity of the sequencing base call by moreeasily identifying artifact “mutations” sequence changes introducedduring the sequencing process.

In certain embodiments, the compositions and methods of this instantdisclosure will be useful in detecting rare mutants against a largebackground signal, such as when monitoring circulating tumor cells;detecting circulating mutant DNA in blood, monitoring or detectingdisease and rare mutations by direct sequencing, monitoring or detectingdisease or drug response associated mutations. Additional embodimentsmay be used to quantify DNA damage, quantify or detect mutations inviral genomes (e.g., HIV and other viral infections) or other infectiousagents that may be indicative of response to therapy or may be useful inmonitoring disease progression or recurrence. In yet other embodiments,these compositions and methods may be useful in detecting damage to DNAfrom chemotherapy, or in detection and quantitation of specificmethylation of DNA sequences.

EXAMPLES Example 1 Dual Cypher Sequencing of a Tumor Genomic Library

Cancer cells contain numerous clonal mutations, i.e., mutations that arepresent in most or all malignant cells of a tumor and have presumablybeen selected because they confer a proliferative advantage. Animportant question is whether cancer cells also contain a large numberof random mutations, i.e., randomly distributed unselected mutationsthat occur in only one or a few cells of a tumor. Such random mutationscould contribute to the morphologic and functional heterogeneity ofcancers and include mutations that confer resistance to therapy. Theinstant disclosure provides compositions and methods for distinguishingclonal mutations from random mutations.

To examine whether malignant cells exhibit a mutator phenotype resultingin the generation of random mutations throughout the genome, dual cyphersequencing of present disclosure will be performed on normal and tumorgenomic libraries. Briefly, genomic DNA from patient-matched normal andtumor tissue is prepared using Qiagen® kits (Valencia, Calif.), andquantified by optical absorbance and quantitative PCR (qPCR). Theisolated genomic DNA is fragmented to a size of about 150-250 base pairs(short insert library) or to a size of about 300-700 base pairs (longinsert library) by shearing. The DNA fragments having overhang ends arerepaired (i.e., blunted) using T4 DNA polymerase (having both 3′ to 5′exonuclease activity and 5′ to 3′ polymerase activity) and the 5′-endsof the blunted DNA are phosphorylated with T4 polynucleotide kinase(Quick Blunting Kit I, New England Biolabs), and then purified. Theend-repaired DNA fragments are ligated into the SmaI site of the libraryof dual cypher vectors shown in FIG. 2 to generate a target genomiclibrary.

The ligated cypher vector library is purified and the target genomiclibrary fragments are amplified by using, for example, the following PCRprotocol: 30 seconds at 98° C.; five to thirty cycles of 10 seconds at98° C., 30 seconds at 65° C., 30 seconds at 72° C.; 5 minutes at 72° C.;and then store at 4° C. The amplification is performed using sensestrand and anti-sense strand primers that anneal to a sequence locatedwithin the adapter region (in certain embodiments, the primer willanneal to a sequence upstream of the AS), and is upstream of the uniquecypher and the target genomic insert (and, if present, upstream of anindex sequence if multiplex sequencing is desired; see, e.g., FIG. 2)for Illumina bridge sequencing. The sequencing of the library describedabove will be performed using, for example, an Illumina® Genome AnalyzerII sequencing instrument as specified by the manufacturer.

The unique cypher tags are used to computationally deconvolute thesequencing data and map all sequence reads to single molecules (i.e.,distinguish PCR and sequencing errors from real mutations). Base callingand sequence alignment will be performed using, for example, the Elandpipeline (Illumina, San Diego, Calif.). The data generated will allowidentification of tumor heterogeneity at the single-nucleotide level andreveal tumors having a mutator phenotype.

Example 2 Dual Cypher Sequencing of a mtDNA Library

Mutations in mitochondrial DNA (mtDNA) lead to a diverse collection ofdiseases that are challenging to diagnose and treat. Each human cell hashundreds to thousands of mitochondrial genomes and disease-associatedmtDNA mutations are homoplasmic in nature, i.e., the identical mutationis present in a preponderance of mitochondria within a tissue (Taylorand Turnbull, Nat. Rev. Genet. 6:389, 2005; Chatterjee et al., Oncogene25:4663, 2006). Although the precise mechanisms of mtDNA mutationaccumulation in disease pathogenesis remain elusive, multiplehomoplasmic mutations have been documented in colorectal, breast,cervical, ovarian, prostate, liver, and lung cancers (Copeland et al.,Cancer Invest. 20:557, 2002; Brandon et al., Oncogene 25:4647, 2006).Hence, the mitochondrial genome provides excellent potential as aspecific biomarker of disease, which may allow for improved treatmentoutcomes and increased overall survival.

Dual cypher sequencing of present disclosure can be leveraged toquantify circulating tumor cells (CTCs) and circulating tumor mtDNA(ctmtDNA) could be used to diagnose and stage cancer, assess response totherapy, and evaluate progression and recurrence after surgery. First,mtDNA isolated for prostatic cancer and peripheral blood cells from thesame patient will be sequenced to identify somatic homoplasmic mtDNAmutations. These mtDNA biomarkers will be statistically assessed fortheir potential fundamental and clinical significance with respect toGleason score, clinical stage, recurrence, therapeutic response, andprogression.

Once specific homoplasmic mutations from individual tumors areidentified, patient-matched blood specimens will be examined for thepresence of identical mutations in the plasma and buffy coat todetermine the frequencies of ctmtDNA and CTCs, respectfully. This willbe accomplished by using the dual cypher sequencing technology of thisdisclosure, and as described in Example 1, to sensitively monitormultiple mtDNA mutations concurrently. The distribution of CTCs inperipheral blood from patients with varying PSA serum levels and Gleasonscores will be determined.

Example 3 High-Resolution Detection of TP53 Mutations

A recent genomics study determined that TP53 is mutated in 96% of highgrade serous ovarian carcinoma (HGSC), responsible for two-thirds of allovarian cancer deaths (Cancer Genome Atlas Research Network, Nature474:609, 2011), and current models indicate that TP53 loss is an earlyevent in HGSC pathogenesis (Bowtell, Nat. Rev. Cancer 10:803, 2010).Thus, the near universality and early occurrence of TP53 mutations inHGSC make TP53 a promising biomarker candidate for early detection anddisease monitoring of HGSC. Dual cypher sequencing of present disclosurewas used to detect somatic TP53 mutations that arose during replicationin E. coli.

Dual Cypher Vector Construction

An oligonucleotide containing EcoRI and BamHI restriction enzyme sites,adapter sequences, indices, and random 7-nucleotide barcodes flanking aSmaI restriction enzyme site with the following sequence was made(Integrated DNA Technologies):

(SEQ ID NO.: 2) GATACAGGATCCAATGATACGGCGACCACCGAGATCTACACTAGATCGCGCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNNNNCCCGGGNNNNNNNCTGTCTCTTATACACATCTCTGAGCGGGCTGGCAAGGCAGACCGTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGGAATTCGATACA.

To amplify and create a double-stranded product from thissingle-stranded DNA oligonucleotide, 30 cycles of PCR were performedusing PfuUltra High-Fidelity DNA Polymerase (Agilent Technologies) asper the manufacturer's instructions (forward primer sequence:GATACAGGATCCAATGATACGG, SEQ ID NO.:3; reverse primer sequence:TGTATCGAATTCCAAGCAGAAG, SEQ ID NO.:4). The following cycling conditionswere used: 95° C. for 2 minutes, followed by 30 cycles of 95° C. for 1minute and 64° C. for 1 minute. The double-stranded nature of theproduct was verified using a SmaI (New England BioLabs) restrictiondigest. The product was then purified (Zymo Research DNA Clean &Concentrator-5) and subjected to EcoRI/BamHI restriction digest usingBamHI-HF (New England BioLabs) and EcoRI-HF (New England BioLabs) toprepare the construct for ligation into an EcoRI/BamHI-digested pUC19backbone. Digested vector and construct were run on a 1.5% UltraPureLow-Melting Point Agarose (Invitrogen) electrophoresis gel with 1×SybrSafe (Invitrogen) and the appropriate bands were excised. The DNA inthe gel fragments was purified using a Zymo-Clean gel DNA recovery kit(Zymo Research) and quantified using a spectrophotometer(Nanophotometer, Implen). Ligation reactions using T4 DNA ligase HC(Invitrogen) and a 1:3 vector to insert molar ratio at room temperaturefor 2 hours were carried out, then ethanol precipitated, and resuspendedin water. Purified DNA (2 μl) was electroporated into ElectroMAX DH10BT1 Phage Resistant Cells (Invitrogen). The transformed cells were platedat a 1:100 dilution on LB agar media containing 100 μg/mL carbenicillinand incubated overnight at 37° C. to determine colony counts, and theremainder of the transformation was spiked into LB cultures forovernight growth at 37° C. The DNA from the overnight cultures waspurified using the QIAquick Spin Miniprep Kit (Qiagen).

A single next generation sequencing run on MiSeq® demonstrated optimalcoverage and diversity at the upstream seven basepair cypher in thevector library. FIG. 3A shows that the each nucleotide was detected atapproximately the same rate at each random position of the cypher (herethe 5′ cyphers were sequenced).

TP53 Exon 4 Library Construction

Briefly, SKOV-3 (human ovarian carcinoma cell line) cells were grown inMcCoy's 5a Medium supplemented with 10% Fetal Bovine Serum, 1.5mM/L-glutamine, 2200 mg/L sodium bicarbonate, andPenicillin/Streptomycin. SKOV-3 cells were harvested and DNA wasextracted using a DNeasy Blood and Tissue Kit (Qiagen). PCR primers weredesigned to amplify exon 4 of human TP53; forward primer sequence:TCTGTCTCCTTCCTCTTCCTACA (SEQ ID NO.:5) and reverse primer sequence:AACCAGCCCTGTCGTCTCT (SEQ ID NO.:6). Thirty cycles of PCR were performedon SKOV-3 DNA using 0.5 μM primers and GoTaq Hot Start Colorless MasterMix (Promega) under the following cycling conditions: 95° C. for 2minutes; 30 cycles of 95° C. for 30 seconds, 63° C. for 30 seconds, 72°C. for 1 minute; followed by 72° C. for 5 minutes. Each PCR product wasthen cloned into TOPO vectors (Invitrogen), transformed into One ShotTOP10 Chemically Competent E. coli cells (Invitrogen), plated on LB agarmedia containing 100 μg/mL carbenicillin and incubated overnight at 37°C.

Ten colonies were picked and cultured overnight. The DNA from theovernight LB cultures was purified using the QIAquick Spin Miniprep Kit(Qiagen). Sequencing of the TOPO clones was performed using capillaryelectrophoresis-based sequencing on an Applied Biosystems 3730×1 DNAAnalyzer. One TOPO clone containing the appropriate wild type TP53 exon4 sequence was selected. The DNA was subjected to EcoRI digestion toexcise the TP53 exon 4 insert and run on a 1.5% UltraPure Low-MeltingPoint Agarose gel. The TP53 exon 4 DNA band was then manually excisedand purified using the Zymo-Clean gel DNA recovery kit followed byphenol/chloroform/isoamyl alcohol extraction and ethanol precipitation.The digested DNA was then blunted and phosphorylated using the QuickBlunting Kit (New England BioLabs) and purified with aphenol/chloroform/isoamyl alcohol extraction and ethanol precipitation.

The Cypher Seq vector library was digested with SmaI, treated withAntartic Phosphatase (New England BioLabs), and run on a 1.5% UltraPureLow-Melting Point Agarose gel. The appropriate band was excised andpurified using the Zymo-Clean gel DNA recovery kit, followed byphenol/chloroform/isoamyl alcohol extraction and ethanol precipitation.Blunt-end ligations of the vector and TP53 exon 4 DNA were then carriedout in 20 μl reactions using T4 DNA Ligase HC (Invitrogen) and a 1:10vector to insert molar ratio. The ligations were incubated at 16° C.overnight, ethanol precipitated, and transformed into ElectroMAX DH10bT1-phage resistant cells. Bacteria were grown overnight at 37° C. in LBcontaining 100 μg/mL carbenicillin and DNA was purified using theQIAquick Spin Miniprep Kit. The presence of the appropriate insert wasverified by diagnostic restriction digest and gel electrophoresis.

The sequencing construct containing the Illumina adapters, barcodes, andTP53 DNA was then amplified using 10 cycles of PCR and primers designedagainst the adapter ends (forward primer: AATGATACGGCGACCACCGA, SEQ IDNO.:7; and reverse primer: CAAGCAGAAGACGGCATACGA, SEQ ID NO.:8). PCRcycling conditions were as follows: 95° C. for 2 minutes; 10 cycles of95° C. for 30 seconds, 63° C. for 30 seconds, 72° C. for 1 minute;followed by 72° C. for 5 minutes. The sequencing construct was gelpurified (Zymo-Clean gel DNA recovery kit), phenol/chloroform/isoamylalcohol extracted and ethanol precipitated. The library was quantifiedusing the Quant-iT PicoGreen assay (Invitrogen) before loading onto theIllumina MiSeq® flow cell. Finally, the library was sequenced.Sequencing was performed as instructed by the manufacturer's protocolwith MiSeq® at Q30 quality level (Illumina). A Q score is defined as aproperty that is logarithmically related to the base calling errorprobabilities (Q=−10 log₁₀ P). In the case of an assigned Q score of 30(Q30) to a base, this means that the probability of an incorrect basecall is 1 in 1,000 times—that is, the base call accuracy (i.e.,probability of a correct base call) is 99.9%—considered the goldstandard for next generation sequencing. Barcodes were used todeconvolute the sequencing data.

Results

TP53 Exon 4 DNA from a dual cypher vector library produced in E. coliwas sequenced with a depth of over a million, and all sequencing readswith identical cypher pairs and their reverse complements were groupedinto families to create a consensus sequence. As illustrated in FIG. 3B,errors introduced during library preparation (open circle) and duringsequencing (gray circle and triangle) were computationally eliminatedfrom the consensus sequence and only mutations present in all reads(black diamonds, FIG. 3B) of a cypher family were counted as truemutations (see bottom of FIG. 3B).

Wild-type TP53 Exon 4 sequence was compared to the actual sequenceresults and substitutions were plotted before (FIG. 4A) and aftercorrection with Cypher Seq (FIG. 4B). Prior to correction, the detectederror frequency was 3.9×10⁻⁴/bp (FIG. 4A). In short, the initial errorfrequency reflects assay-related errors (e.g., PCR, sequencing, andother errors introduced after bar-coding). This means that detecting arare mutation is difficult due to the noise-to-signal ratio being veryhigh. After Cypher Seq correction, however, the error frequency droppedto 8.8×10⁻⁷/bp (FIG. 4B). In other words, the remaining substitutionsare most likely biological in nature and most likely reflect errorsintroduced during replication in E. coli prior to ligation into thebarcoded vectors. Thus, true mutations (i.e., those that arise naturallyin a cell during replication) are readily detectable using the cyphersystem of the instant disclosure.

The various embodiments described above can be combined to providefurther embodiments. All of the U.S. patents, U.S. patent applicationpublications, U.S. patent applications, foreign patents, foreign patentapplications and non-patent publications referred to in thisspecification and/or listed in the Application Data Sheet areincorporated herein by reference, in their entirety. In general, in thefollowing claims, the terms used should not be construed to limit theclaims to specific embodiments disclosed in the specification andclaims, but should be construed to include all possible embodimentsalong with the full scope of equivalents to which such claims areentitled. Accordingly, the claims are not limited by the disclosure.

1.-38. (canceled)
 39. A method comprising: (a) providing a plurality ofcirculating DNA molecules obtained from a patient sample; (b) ligatingthe circulating DNA molecules to cypher polynucleotides to formdouble-stranded cypher-target nucleic acid complexes, wherein: (i) thecypher polynucleotides comprise identifier tags selected from aplurality of distinct identifier tag sequences; (ii) at least two of theidentifier tags are identical in sequence and are ligated to differentcirculating DNA molecules, thereby non-uniquely tagging the differentcirculating DNA molecules; and (iii) an identifier tag alone or incombination with an end of a circulating DNA molecule uniquelyidentifies a cypher-target nucleic acid complex; (c) amplifying thecypher-target nucleic acid complexes to produce a correspondingplurality of cypher-target amplification products; (d) sequencing thecypher-target amplification products to produce a plurality ofsequencing reads; (e) grouping the sequencing reads into groups, each ofthe groups comprising the same identifier tag sequence and the samecirculating DNA end sequences, wherein each of the groups comprisessequencing reads from the cypher-target amplification products of one ofthe cypher-target nucleic acid complexes; and (f) comparing thesequencing reads within the groups, and generating error-correctedsequences of the circulating DNA molecules by distinguishing erroneousnucleotides in one strand that lack a matched base change in thecomplementary strand.
 40. The method of claim 39, further comprisingidentifying one or more single nucleotide mutations.
 41. The method ofclaim 39, wherein the ligating comprises ligating to an overhang or ablunt end.
 42. The method of claim 39, further comprising detectingmutations in one or more of the error-corrected sequences as compared toa reference sequence.
 43. The method of claim 39, wherein sequencing thecypher-target amplification products comprises converting data from asequencing instrument into quality scores and then into sequencingreads.
 44. The method of claim 39, further comprising purifying aplurality of cypher-target nucleic acid complexes prior to sequencing,wherein the purified cypher-target nucleic acid complexes comprisenucleic acid molecules from specific genomic regions.
 45. The method ofclaim 39, wherein the plurality of circulating DNA molecules comprise amutation present at a frequency of 2.1×10⁻⁶ or lower.
 46. The method ofclaim 39, wherein generating the error corrected sequences results in ameasureable sequencing error rate from about 10⁻⁶ to about 10⁻⁸.
 47. Themethod of claim 39, wherein the circulating DNA molecules compriseplasma DNA biomarkers.
 48. The method of claim 39, wherein eachidentifier tag of the plurality of distinct identifier tag sequences isa random or partially random sequence of about 5 nucleotides in length.49. The method of claim 39, wherein each identifier tag of the pluralityof distinct identifier tag sequences is a random or partially randomsequence of 5 or 6 nucleotides in length.
 50. The method of claim 39,wherein the cypher polynucleotides comprising the identifier tags arecontained within a pool of cypher polynucleotides comprising knownsequences.
 51. The method of claim 39, wherein the ligating comprisesligating identifier tags to both ends of the circulating DNA molecules,and further wherein the identifier tags at both ends together form aunique pair of identifiers that differ between each of the other pairsof identifiers ligated to the circulating DNA molecules.
 52. The methodof claim 39, wherein grouping sequencing reads is based on (i) theidentifier tag sequence and (ii) sequence information from an end of thecirculating DNA molecule.
 53. The method of claim 39, wherein: (i) theplurality of cypher-target amplification products comprisesamplification products from first strands and complementary secondstrands of the cypher-target nucleic acid complexes; (ii) the pluralityof sequencing reads comprises a plurality of first-strand sequencingreads and a plurality of second-strand sequencing reads; and (iii) thecomparing comprises comparing the first-strand sequencing reads with thesecond-strand sequencing reads within the groups.
 54. A methodcomprising: (a) ligating cypher polynucleotides to circulating DNAmolecules obtained from a patient sample to form double-strandedcypher-target nucleic acid complexes, wherein: (i) the cypherpolynucleotides comprise identifier tags selected from a plurality ofdistinct identifier tag sequences; (ii) at least two of the identifiertags are identical in sequence and are ligated to different circulatingDNA molecules, thereby non-uniquely tagging the different circulatingDNA molecules; and (iii) an identifier tag alone or in combination withan end of a circulating DNA molecule uniquely identifies a cypher-targetnucleic acid complex; (b) amplifying the cypher-target nucleic acidcomplexes to produce a corresponding plurality of cypher-targetamplification products; (c) sequencing the cypher-target amplificationproducts to produce a plurality of sequencing reads; (d) grouping thesequencing reads based on (i) the identifier tag sequence and (ii)sequence information from the circulating DNA molecule, wherein a groupcomprises sequencing reads from the cypher-target amplification productsof one of the cypher-target nucleic acid complexes; and (e) comparingthe sequencing reads within the groups, and generating error-correctedsequences of the circulating DNA molecules by distinguishing erroneousnucleotides in one strand that lack a matched base change in thecomplementary strand.
 55. The method of claim 54, further comprisingpurifying a plurality of cypher-target nucleic acid complexes prior tosequencing, wherein the purified cypher-target nucleic acid complexescomprise nucleic acid molecules from specific genomic regions.
 56. Themethod of claim 54, further comprising identifying one or more singlenucleotide mutations.
 57. The method of claim 54, wherein thecirculating DNA molecules comprise blood biomarkers.
 58. The method ofclaim 54, wherein the circulating DNA molecules comprise DNA moleculesderived from cancer cells.
 59. The method of claim 54, wherein eachidentifier tag of the plurality of distinct identifier tag sequences isa random or partially random sequence of about 5 nucleotides in length.60. The method of claim 54, wherein each identifier tag of the pluralityof distinct identifier tag sequences is a random or partially randomsequence of 5 or 6 nucleotides in length.
 61. The method of claim 54,wherein: (i) the plurality of cypher-target amplification productscomprises amplification products from first strands and complementarysecond strands of the cypher-target nucleic acid complexes; (ii) theplurality of sequencing reads comprises a plurality of first-strandsequencing reads and a plurality of second-strand sequencing reads; and(iii) the comparing comprises comparing the first-strand sequencingreads with the second-strand sequencing reads within the groups.
 62. Themethod of claim 53, further comprising identifying one or more singlenucleotide mutations.
 63. The method of claim 53, wherein the ligatingcomprises ligating to an overhang or a blunt end.
 64. The method ofclaim 53, further comprising detecting mutations in one or more of theerror-corrected sequences as compared to a reference sequence.
 65. Themethod of claim 53, wherein sequencing the cypher-target amplificationproducts comprises converting data from a sequencing instrument intoquality scores and then into sequencing reads.
 66. The method of claim53, further comprising purifying a plurality of cypher-target nucleicacid complexes prior to sequencing, wherein the purified cypher-targetnucleic acid complexes comprise nucleic acid molecules from specificgenomic regions.
 67. The method of claim 53, wherein the plurality ofcirculating DNA molecules comprise a mutation present at a frequency of2.1×10⁻⁶ or lower.
 68. The method of claim 53, wherein generating theerror corrected sequences results in a measureable sequencing error ratefrom about 10⁻⁶ to about 10⁻⁸.
 69. The method of claim 53, wherein thecirculating DNA molecules comprise plasma DNA biomarkers.
 70. The methodof claim 53, wherein each identifier tag of the plurality of distinctidentifier tag sequences is a random or partially random sequence ofabout 5 nucleotides in length.
 71. The method of claim 53, wherein eachidentifier tag of the plurality of distinct identifier tag sequences isa random or partially random sequence of 5 or 6 nucleotides in length.72. The method of claim 53, wherein the cypher polynucleotidescomprising the identifier tags are contained within a pool of cypherpolynucleotides comprising known sequences.
 73. The method of claim 53,wherein the ligating comprises ligating identifier tags to both ends ofthe circulating DNA molecules, and further wherein the identifier tagsat both ends together form a unique pair of identifiers that differbetween each of the other pairs of identifiers ligated to thecirculating DNA molecules.
 74. The method of claim 53, wherein groupingsequencing reads is based on (i) the identifier tag sequence and (ii)sequence information from an end of the circulating DNA molecule. 75.The method of claim 61, further comprising purifying a plurality ofcypher-target nucleic acid complexes prior to sequencing, wherein thepurified cypher-target nucleic acid complexes comprise nucleic acidmolecules from specific genomic regions.
 76. The method of claim 61,further comprising identifying one or more single nucleotide mutations.77. The method of claim 61, wherein the circulating DNA moleculescomprise blood biomarkers.
 78. The method of claim 61, wherein thecirculating DNA molecules comprise DNA molecules derived from cancercells.
 79. The method of claim 61, wherein each identifier tag of theplurality of distinct identifier tag sequences is a random or partiallyrandom sequence of about 5 nucleotides in length.
 80. The method ofclaim 61, wherein each identifier tag of the plurality of distinctidentifier tag sequences is a random or partially random sequence of 5or 6 nucleotides in length.